System and method for speech recognition using tonal modeling

ABSTRACT

A system and method for speaker independent speech recognition is provided that integrates spectral and tonal analysis in a sequential architecture. The system analyzes the spectral content of a spoken syllable, or group of syllables, ( 18 ) and generates a spectral score for each of a plurality of predicted syllables ( 46, 22 ). Time alignment information ( 36 ) for the predicted syllable(s) is then sequentially passed to a tonal modeling block ( 14 ) which performs an iterative fundamental frequency contour estimation for the spoken syllable(s). The tones of adjacent syllables, as well as the rate of change of the tonal information, is then used to generate a tonal score for each of the plurality of predicted syllables. The tonal score ( 34 ) is then arithmetically combined with ( 40 ) the spectral score ( 32 ) in order to generate an output prediction.

CROSS REFERENCE TO RELATED APPLICATION

This application is a 371 of PCT/US00/32230 filed on Nov. 22, 2000 whichclaims benefit of 60/167,172 filed on Nov. 23, 1999.

BACKGROUND

1. Technical Field

The present invention is directed to the field of speech recognition.More specifically, the invention provides a speaker-independent speechrecognition system and method for tonal languages in which a spectralscore is combined, sequentially, with a tonal score to arrive at a bestprediction for a spoken syllable.

2. Description of the Related Art

Recently, there have been many advancements in speech recognitionsystems. Most of these systems, however, are developed for westernlanguages such as English, which are non-tonal, as distinguished frommany eastern languages such as Chinese, which are tonal. In a tonallanguage, the tone of the speech is related to its meaning, andtherefore it is insufficient to simply analyze the spectral content ofthe spoken syllable(s), as can be done in analyzing non-tonal languages.A tonal language typically has four to nine tones. For example, thesetones are classified into “high,” “rising,” “dip,” or “falling” inMandarin Chinese, which has four tones. Explicit recognition of thesetones is difficult, however, since different speakers have differentspeaking characteristics. In languages such as Chinese, tones arecharacterized by features such as the fundamental frequency (F0) valuesand corresponding contour shapes. These values and shapes are difficultto capture and properly analyze for speaker-independent recognitionbecause the absolute value of F0 varies greatly between speakers. Forexample, the high tone of a low-pitch speaker can be the same or similarto the low tone of a high-pitch speaker.

Several known speech recognition systems for tonal languages aredescribed in CN 1122936, U.S. Pat. No. 5,787,230, CN 1107981, CN1127898, U.S. Pat. No. 5,680,510, U.S. Pat. No. 5,220,639, WO 97/40491,WO 96/10248, and U.S. Pat. No. 5,694,520 Many of these systems, however,rely on the absolute value of the syllable's fundamental frequency (F0)in order to ascertain the proper tone, and thus fail to properlydiscriminate between speakers having differing tonal characteristics.These systems typically must be “trained” for a particular speaker priorto proper operation. In addition, each of these systems utilizes aparallel processing architecture that prohibits an integrated analysisof the spectral and tonal information, thus further limiting theirusefulness in a speaker-independent application.

SUMMARY

A system and method for speaker-independent speech recognition isprovided that integrates spectral and tonal analysis in a sequentialarchitecture. The system analyzes the spectral content of a spokensyllable (or group of syllables) and generates a spectral score for eachof a plurality of predicted syllables. Time alignment information forthe predicted syllable(s) is then sequentially passed to a tonalmodeling block, which performs an iterative fundamental frequency (F0)contour estimation for the spoken syllable(s). The tones of adjacentsyllables, as well as the rate of change of the tonal information, isthen used to generate a tonal score for each of the plurality ofpredicted syllables. The tonal score is then arithmetically combinedwith the spectral score in order to generate an output prediction.

An aspect of the present invention provides a speech recognition methodthat may include the following steps: (a) receiving a speech waveform;(b) performing a spectral analysis of the speech waveform and generatinga set of syllabic predictions, each syllabic prediction including one ormore predicted syllables, wherein the set of syllabic predictionsincludes a spectral score and timing alignment information of the one ormore predicted syllables; (c) sequentially performing a tonal analysisof the input speech waveform using the timing alignment information andgenerating tonal scores for each of the syllabic predictions; and (d)combining the spectral score with the tonal score for each of thesyllabic predictions in order to generate an output prediction.

Another aspect of the invention provides a speech recognition systemthat includes several software and/or hardware implemented blocks,including: (a) a spectral modeling block that analyzes a speech waveformand generates a plurality of predicted syllables based upon the spectralcontent of the speech waveform, wherein each of the predicted syllablesincludes an associated spectral score and timing alignment informationindicating the duration of the syllable; and (b) a tonal modeling blockthat sequentially analyzes the speech waveform using the timingalignment information from the spectral modeling block and generates aplurality of tone scores based upon the tonal content of the speechwaveform for each of the predicted syllables.

Still another aspect of the invention provides a system for analyzing aspeech waveform. This system preferably includes a spectral modelingbranch for generating a spectral score, and a tonal modeling branch forgenerating a tonal score The spectral modeling branch generates timingalignment information that indicates the beginning and ending points fora plurality of syllables in the speech waveform and provides this timingalignment information to the tonal modeling branch in order tosequentially analyze the speech waveform.

An additional method according to the invention provides a method foranalyzing a speech waveform carrying a plurality of syllables. Thismethod preferably includes the following steps: (a) performing aspectral analysis on the speech waveform and generating one or morespectral scores for each syllable; (b) performing a tonal analysis onthe speech waveform and generating one or more tonal scores for eachsyllable, wherein the tonal scores are generating by comparing thefundamental frequencies of two or more adjacent syllables; and (c)combining the spectral scores with the tonal scores to produce an outputprediction.

Still another, more specific method according to the invention providesa method of recognizing tonal information in a speech waveform. Thismethod preferably includes the following steps: (a) generating timingalignment information for a plurality of syllables in the speechwaveform; (b) determining a center point within each syllable of thespeech waveform using a beginning and ending point specified by thetiming alignment information; (c) determining the energy of the syllableat the center point; (d) generating an analysis window for eachsyllable, wherein the analysis window is centered at the center pointand is bounded on either side of the center point by calculating thepoints at which the energy of the syllable has decreased to a firstpredetermined percentage of the energy at the center point; (e)computing a fundamental frequency contour within the analysis window;(f) extracting one or more tonal features from the fundamental frequencycontour; and (g) generating a plurality of tonal scores for eachsyllable based on the one or more extracted tonal features.

Other aspects of the invention, not set forth specifically above, willbe apparent to one of skill in this field upon reading the descriptionof the drawings, set forth below.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a speaker-independent speech recognitionsystem according to the present invention;

FIG. 2 is a flowchart depicting a series of steps for F0 contourestimation according to the present invention:

FIG. 3 is an example F0 contour plot generating by the methodology ofthe present invention depicting three spoken syllables;

FIG. 4 is a timing diagram depicting three spoken syllables includingtonal information.

DETAILED DESCRIPTION

Turning now to the drawing figures, FIG. 1 is a block diagram of aspeaker-independent speech recognition system according to the presentinvention. This system includes two branches (or paths), an upper branch12, which performs spectral modeling of an input waveform and produces aspectral score 32, and a lower branch 14, which performs tonal modelingbased on the input waveform and also based upon information receivedfrom the upper branch 12, and produces a tonal score 34. A combinationblock then combines the spectral score 32 with the tonal score 34 inorder to generate a best output prediction 42 for the spokensyllable(s). In this fashion, the present invention provides asequential architecture for speech recognition in which information fromthe spectral analysis is used in the tonal analysis to provide a morerobust result.

Not explicitly shown in FIG. 1 is front-end hardware (or software) forgenerating the input waveform 16, and back-end hardware (or software)for using the output prediction 42. This front-end hardware may includea microphone, an analog-to-digital converter and a digital signalprocessor (DSP), depending upon the application of the system. Forexample, the system 10 could be integrated into a variety ofapplications, such as a general-purpose speech recognition program, atelephone, cellular phone, or other type of electronic appliance, or anyother type of software application or electronic device that may requirespeaker-independent speech recognition capability. Preferably, however,the input waveform 16 is a digital waveform.

The spectral modeling branch 14 includes a spectral analysis block 18, afeature extraction block 20, a model scoring block 22, and an N-bestsearch block 24. The model scoring block 22 receives information from amodel database 46, and the N-best search block 24 receives informationfrom a vocabulary database 48.

The spectral analysis block 18 receives the input waveform 16 andperforms a frequency-domain spectral analysis of the spoken syllable(s).Example spectral analysis could include a fast-fourier transform (FFT),or a mel frequency cepstral coefficients analysis (MFCC), or a linearprotection coefficient analysis (LPC). Regardless of the exact type ofspectral analysis performed, the spectral analysis block 18 generates asequence of frames that include a multi-dimensional vector thatdescribes the spectral content of the input waveform 16.

The sequence of frames from the spectral analysis block 18 are thenprovided to the feature extraction block 20. The feature extractionblock analyses the multi-dimensional vector data in the sequence offrames and generates additional dimensionality data that furtherdescribes certain features of the input waveform 16. For example, thefeature extraction block 20 may compute a differential between twoadjacent frames for each of the dimensions in the vector, and it maythen computer a differential of the computed differential, or it maycompute energy, or some other related calculation. These calculationsrelate to certain features of the spoken syllables that can be furtherutilized by the model scoring block 22 in order to properly predict theactual speech.

The multi-dimensional vector data from the spectral analysis block 18and the additional computations from the feature extraction block 20(collectively referred to as the feature vector) are then provided tothe model scoring block 22. The model scoring block may use a Gaussiandistribution function in order to compute a probability result that thefeature vector corresponds to a particular spectral model of somesyllable (or syllables). At this point it is important to note that thesystem described herein could be configured at a variety of levels ofgranularity. Thus, for example, the system may be configured to analyzeone letter at a time, or one syllable at a time, or a group of syllablesat a time, or entire words at a time. Regardless of the granularity ofthe analysis, however, the basic steps and functions set forth would bethe same.

The model scoring block 22 utilizes data from a model database 46 incomputing its probabilities for a particular set of input data (featurevector). The model database preferably includes a Hidden Markov Model(HMM), although other types of models could also be utilized. For moreinformation on the HMM, see Robustness in Automatic Speech Recognition,by Hisashi Wakita, pp. 90–102. Using the input data from the spectralanalysis block 18 and the feature extraction block 20, the model scoringblock develops a prediction (or score) for each entry in the modeldatabase. Higher scores are associated with more likely spectral models,and lower scores with less likely models.

The scores for each of the models from the model scoring block 22 arethen passed to the N-Best search block 24, which compares these scoresto data stored within a vocabulary database in order to derive a set ofpredictions for the most likely spoken syllables (or letters, or wordsdepending on the application). The vocabulary database is typicallyorganized into a series of words that include syllables and tonesassociated with those syllables, although other symantical organizationsfor the vocabulary are certainly possible. If the vocabulary is on aword level, then the scores at the frame level (or syllable level) maybe combined by the N-best search block 24 prior to comparison to thedata in the vocabulary database 48.

The N-Best search block 24 provides two outputs 32, 36. The first outputis a set of spectral scores 32 for the most likely syllables (or wordsor sentences) as determined by comparing the model scoring informationto the data stored in the vocabulary database 48. These spectral scores32 are preferably described in terms of a probability value, and arethen provided to the combination block 40 for combination with the tonalscores 34.

For each of the set of most likely syllables, the N-Best search block 24also provides time alignment information 36, which is provided o the F0estimation block 26 of the tonal analysis branch 14. The time alignmentinformation 36 includes information as to where (in time) a particularsyllable begins and ends. This information 36 also includes the identityof the predicted syllables (and their associated tone) as determined bythe N-Best search block 24. Thus, for example, if the N-Best searchblock is configured to predict the three most likely syllables spoken,then the time alignment information 36 passed to the F0 estimation block26 would include beginning and ending timing information for each of thethree syllables, the identity of the syllable, and its tone.

In the tonal modeling section 14, the input speech waveform 16 undergoesanalysis by an F0 estimation block 26, a feature extraction block 28,and a model scoring block 30. In the following description, it isillustrative to examine FIG. 1, as well as FIG. 2, which is a flowchartdepicting a series of steps for F0 contour estimation 26 according tothe present invention.

The general operation of the tonal analysis branch 14 is as follows. Theinput waveform 16 is input to the fundamental frequency (F0) estimationblock 26, which also receives the time alignment information 36 from theN-Best search block 24. The F0 estimation block 26 uses the inputwaveform and the time alignment information in order to output an F0contour 44, as further described below. The F0 contour 44 determinationis preferably based on the Average Magnitude Difference Function (AMDF)algorithm. Following the F0 contour determination, the system thenextracts numerous features from the F0 contour of the input waveformusing a feature extraction block 28, such as the ratio of the average F0frequencies of adjacent syllable pairs and the slope of the first-orderleast squares regression line of the F0 contour. These features are theninput to a statistical model 30 that preferably uses a two-dimensionalfull-covariance Gaussian distribution to generate a plurality of tonescores 34 for each of the predicted syllables from the N-Best searchblock 24. The tone score 34 is combined, preferably linearly, with thespectral score 32 from the spectral analysis branch 12 for each of thepredicted syllables in order to arrive at a set of final scores thatcorrespond to an output prediction 42.

The tonal modeling section 14 is now described in more detail.

1. F0 Estimation Algorithm

FIG. 2 is a flowchart depicting a series of steps for F0 contourestimation 26 according to the present invention. The F0 estimationalgorithm involves an initial second order lowpass filtering operation110, followed by a methodology based on the AMDF algorithm. The basicdescription is as follows.

1.1 Low Pass Filtering

The second order lowpass filter step 110 on the input waveform 16preferably is described by the following transfer function:

${H(z)} = \frac{1}{1 - {1.6z^{- 1}} + {0.64z^{- 2}}}$This operation will eliminate high frequency noise in the input signal.Other transfer functions, and other types of filtering operations couldalso be executed at this stage of the methodology.

1.2 Alignment

Following the low-pass filtering step 110, the F0 estimation block 26then receives the time alignment information 36 at step 112 from theN-Best search block 24 of the spectral modeling branch 12. As describedpreviously, this information 36 includes beginning and ending timinginformation for each of the predicted syllables from the spectralanalysis, and also includes the identity of the predicted syllables andtheir corresponding tones. The primary purpose of the tonal modelingblock is to predict which of these spectral analysis predictions is mostlikely given an analysis of the tonal information in the actual inputwaveform 16. A center point for each syllable can then be identified bydetermining the point of maximum energy within the syllable.

1.3 AMDF

Following step 112, the F0 estimator block 26 computes the fundamentalfrequency contour for the entire frame at step 114 using the AMDFalgorithm, the frame corresponding to a particular prediction (whichcould be a letter, syllable, word or sentence as discussed above). Thisstep also computes the average frequency F_(AV), for the entire frame ofdata. The AMDF algorithm produces an estimate of the fundamentalfrequency using an N data point length window of the lowpass filteredwaveform 16 that corresponds to the type of prediction. In this method,a difference function is computed at each frame where a value offundamental is required. The equation for the difference function is asfollows:

${y_{n}(k)} = {{\sum\limits_{m = 0}^{N}{x\left( {n + m} \right)}} - {x\left( {n + m - k} \right)}}$in which y_(n)(k) dips sharply for k=P, 2P, . . . , where P is thefundamental period. Since period is the inverse of frequency, bydetermining the periodicity of the waveform, the fundamental frequenciescan be derived.

Thus, each local minimum of this difference function is associated witha multiple of the fundamental period. In general, the fundamental periodis identified at the point where the global minimum within the N-pointwindow occurs. However, due to various distortions in the speechwaveform and effects such as laryngealization, this is often not thecase. In fact, particularly at vowel-consonant transition boundaries,the global minimum can occur at half or integer multiples of thefundamental period, and therefore, estimation of the contour is highlyprone to error. These halving or doubling errors manifest as largedeviations in the fundamental contour when, in fact, the true contouritself is smooth, with only small gradations in changes. So, in order toameliorate such errors, it is necessary to employ other means in orderto choose the correct local minimum, which represents the fundamentalperiod. In this algorithm, multiple passes are subsequently conductedthrough the waveform in order to choose the correct minimum thatcorresponds with the fundamental period. These additional steps aredescribed below with reference to steps 116–122.

1.4 Iterative F0 Re-Estimation

The actual F0 contour estimation set forth in FIG. 3 consists of severalpasses through the entire spoken utterance (i.e., all the data presentin the input waveform 16. This is done in order to reduce the number ofhalving or doubling errors. These errors are more susceptible at theedges of the vowel, that is at the consonant-vowel transitionboundaries. Also, if voicing is absent, the estimation of F0 ismeaningless and the value of F0 should be ignored. In the absence of anaccurate alignment of the vowel-consonant boundary, it is necessary toincorporate automatic voicing detection into the F0 estimationalgorithm.

1.4.1 Islands of Reliability

In order to reduce these halving and doubling errors, the presentinvention introduces the concept of “islands of reliability.” Theseislands of reliability are first computed in step 116 of the preferredmethodology utilizing the time alignment information 36 received at step112. The point of maximum energy near the center of each syllable hasbeen previously obtained in step 112 from an alignment provided by thespectral analysis branch. The speech segment in which energy remainsabove P percent of the maximum is then marked as an island ofreliability in step 116. Here, the value of “P” is a predeterminedamount and may vary from application to application. The concept of theisland of reliability is to provide a speech segment over which thebasic F0 estimator or AMDF algorithm produces very reliable results.FIG. 3 sets forth a portion of the F0 contour 200 for three spokensyllables 202, 204, 206 in which the initial island of reliability foreach syllable is shown as 208.

For this first pass, at a fixed interval of frames, the differencefunction set forth above is computed whenever the frame falls within anisland of reliability. The fundamental period, pertaining to that frame,is chosen as the global minimum of the difference function. Any localminimums are ignored at this stage of the process. Then, an overallaverage F0 is computed from all such values computed. This forms aninitial estimate that indicates the average pitch, F_(AV) of thespeaker's voice and the final fundamental frequency contour shouldreside around this vicinity.

1.4.2 F0 Estimation

As a second pass through the waveform, the F0 contour is establishedwithin the islands of reliability, but this time both global and localminimums are considered. Again the difference function is computed forall frames that lie within these islands. Now, in order to determine thetrue pitch contour, two sources are utilized to make each estimate fromthe difference function, y_(n)(k), as defined above. The algorithmsearches for (i) the global minimum K_(G) of the difference function and(ii) the local minimum K_(L) that is closest to the period of theaverage fundamental, F_(AV), as computed in the first pass above. Theglobal minimum, K_(G) in (i), is always chosen if the value of theminimum is much less than the other local minimum (ii) by somepredetermined threshold scaled value. Otherwise K_(L) in (ii) is chosen.Therefore,

$F_{0} = \left\{ \begin{matrix}{1/K_{G}} & {{{if}\mspace{14mu}{y_{n}\left( K_{G} \right)}} < {\delta \times {y_{n}\left( K_{L} \right)}}} \\{1/K_{L}} & {otherwise}\end{matrix} \right.$In this manner, the F0 contour is predicted from left to right of theutterance at the marked islands of reliability. The reason that K_(L) ischosen over K_(G), unless K_(G) is much less than the other localminimum, is that a typical speaker's tone cannot change very rapidly,and thus it is more likely that the correct F0 calculation is based onthe local minima that is closest to the average fundamental frequencyfor the entire data frame.

1.4.3 Island Expansion

The next pass through the speech data involves the determination of theF0 contour from each boundary of the initial islands of reliability topoints on either side of the islands at which the energy of the waveformdrops below R percent of the maximum energy within the island. In thispass 120, the boundary at which voicing in the vowel terminates isdetermined. This is done by examining the data frame to the left orright of the initial island boundaries and then assuming that when theenergy in the frame data drops below R percent of its maximum value atthe vowel center in the initial island of reliability, the F0 estimatewould not be reliable. This is due to the absence of voicing, and so theF0 values are ignored beyond this cutoff point. In this manner, theinitial islands of reliability are expanded to the right and left of theinitial boundaries. FIG. 3 sets forth a portion of the F0 contour 200for three spoken syllables 202, 204, 206 in which the initial island ofreliability for each syllable is shown as 208, and the expanded islandof reliability for each syllable is shown as 210.

At step 122, the fundamental frequency contour F0 is then recomputedover the expanded island of reliability. For the F0 contour to the rightof each island of reliability 208, the contour is estimated from left toright, and vice versa for the F0 contour to the left of each island.Again, for every time the difference function is computed, twoparticular locations are marked. The method searches for (i) the globalminimum K_(G) and (ii) the local minimum K_(L) whose occurrence is mostproximate to the fundamental period value to the immediate left of thecurrent estimated value. The global minimum K_(G) in (i) is alwayschosen if the value of the minimum is much less than the other localminimum (ii) by some predetermined threshold value δ. Otherwise K_(L) in(ii) is chosen as the fundamental period.

These steps 120, 122 are very similar to the F0 estimation within theislands of reliability in step 118. In a similar fashion, the procedurecontinues from right to left to estimate the fundamental frequencyvalues to the left of the islands of reliability, beginning at the leftboundary of each of these islands and terminating when energy fallsbelow R percent of the maximum energy within the syllable.

This method uses the global minimum of the difference function,y_(n)(k), as an estimate of the fundamental period if that value is notvery far from previous estimates of the pitch contour. In many cases theminimum calculations in (i) and (ii) will coincide at the same point andthere is no question of where the fundamental period occurs. The aim isto produce a fundamental contour that is as smooth as possible with aminimum number of discontinuities and sudden changes is likely to becloser to the true contour.

1.5 Median Filtering

As an additional measure to produce a smoother contour, a five-pointmedian filter is applied in step 124. This operation is used to smooththe contour data, and produces the F0 contour output 44, which is thensupplied to the feature extraction block 28 of the tonal analysis branch14.

2. Tone Feature Extraction and Modeling Algorithm

After the F0 contour has been computed, features are extractedpertaining to tone information for generating a tonal score, which willeventually be combined with the spectral score in order to arrive at afinal output prediction 42. These steps are carried out by the featureextraction block 28 and the model scoring block 30. The tone model ispreferably based on a two-dimensional full-covariance Gaussian model,although other tonal models could also be used. During training of thistype of model, a separate sub-model for each unique combination of tonepairs is built. Each syllable in the vocabulary database 48 isassociated with a tone of its own. Therefore, for a vocabulary of Nsyllables, there is a total of N squared sub-models.

The tone model preferably consists of two dimensions: (1) a ratio of theaverage tone frequency of a syllable to the average tone frequency ofthe following syllable (in order to compare to the tone pairs); and (2)a slope of the fundamental frequency F0 as estimated by a regressionline of one of the syllables. In (1), the tone frequency is estimated byaveraging the F0 frequencies for each syllable and then the ratio ofadjacent syllables is taken. In (2) the slope of the contour at thesyllable is estimated by a first order least squares linear regressionline. These two features are provided by the feature extraction block 28operating on the output F0 contour 44 from the F0 estimation block 26,and are then provided to the model scoring block which derives aGaussian score 34 for each pair of adjacent syllables. By scoring thetonal information based on adjacent tones, the present inventionovercomes a primary disadvantage of known systems that only derive tonalinformation based on the absolute value of the fundamental frequency F0contour, and which do not take into account adjacent tones. Thisadvantage of the present invention enables use in a speaker-independentenvironment.

Having computed the spectral score 32 for a particular set of predictivesyllables from the spectral branch 12, and having computed thecorresponding tonal score 34 for the same set of predictive syllablesfrom the tonal branch 14, the system shown in FIG. 1 then combines thesescores in a combination block 40, as further discussed below, in orderto derive a final output prediction 42.

The combination block 40 need only take into account those tone scoresthat correspond with syllables whose hypothesized predictions aredifferent from the ones in the same positions in any of the Nhypotheses. In other words, if for all N hypotheses, the syllable pairin question is hypothesized the same for each case and hence yieldingthe same tone score, then that score is ignored. Effectively, thecombination block 40 only takes into account those tone scores in whichthe hypotheses are different in at least one of the N hypotheses arekept. At each hypothesis, these tone scores are averaged over the numberof syllable pairs whose tone scores are nonzero to form the tone scoreS_(i). For example, in the following example the number of hypotheses,N=3, and αi are hypothesized syllables in an utterance with fivesyllables:

-   -   Hyp 1: α1 α2 α3 α4 α5    -   Hyp 2: α6 α7 α3 α4 α5    -   Hyp 3: α6 α7 α3 α4 α6        Only the first two and final syllables differ in their        hypothesized labels for the three hypotheses and the rest are        identical. Therefore, if s(αiαj) represents the tone score 34        for the syllable pair αi and αj then, St=(s(α1α2)+s(α2        α3)+s(α4α5))/3 for hyp1 and St=(s(α6α7)+s(α7α3)+s(α4α5))/3 for        hyp 2 and St=(s(α6α7)+s(α7α3)+s(α4α6))/3 for hyp 3. Finally, St        is scaled with a predetermined scaling factor β and is        subsequently combined with the spectral score Ss for the        utterance to form the final score S_(TOTAL).        S _(TOTAL) =St+βSs        This final score is then used to reorder the hypotheses to        produce a new N-best list as a final output prediction 42.

FIG. 4 is a timing diagram depicting three spoken syllables includingtonal information. This figure illustrates a sequence of threesyllables: x(3), y(1), and z(2), where x, y, and z denote the syllablesand the digits inside the parentheses, (3), (1), and (2), denote thetones of the respective syllables. When the tone-recognition componentof the present invention computes the probability of having Tone 3between t1 and t2 and having Tone 1 between t2 and t3, it utilizes thepitch information between t1 and t3. This strategy has two advantages:(1) it reduces the sensitivity of the recognition software to differentspeaking characteristics of the speakers; and (2) it capturesco-articulatory effects of two adjacent syllables and tones.

Having described in detail the preferred embodiments of the presentinvention, including the preferred methods of operation, it is to beunderstood that this operation could be carried out with differentelements and steps. This preferred embodiment is presented only by wayof example and is not meant to limit the scope of the present inventionwhich is defined by the following claims.

1. A speech recognition method, comprising the steps of: receiving aspeech waveform; performing a spectral analysis of the speech waveformand generating a set of syllabic predictions, each syllabic predictionincluding one or more predicted syllables, wherein the set of syllabicpredictions includes a spectral score and timing alignment informationof the one or more predicted syllables; sequentially performing a tonalanalysis of the input speech waveform using the timing alignmentinformation and generating tonal scores for each of the syllabicpredictions; and combining the spectral score with the tonal score foreach of the syllabic predictions in order to generate an outputprediction.
 2. The method of claim 1, wherein the spectral analysis stepis performed using a fast fourier transform algorithm.
 3. The method ofclaim 1, wherein the spectral analysis step is performed using a melfrequency cepstral coefficients algorithm.
 4. The method of claim 1,wherein the spectral analysis step is performed using a linearprotection coefficient algorithm.
 5. The method of claim 1, wherein thespectral analysis step generates a sequence of data frames that includea multi-dimensional vector that describes the spectral content of thespeech waveform.
 6. The method of claim 5, wherein the spectral analysisstep further comprises the step of: analyzing the multi-dimensionalvector in the sequence of data frames and generating a feature vectorfor each data frame, the feature vector including the multi-dimensionalvector and one or more additional dimensional vectors that describe aspectral feature of the speech waveform.
 7. The method of claim 6,wherein the spectral feature is the energy of the speech waveform. 8.The method of claim 6, wherein the spectral feature is a differentialcalculation of the speech waveform.
 9. The method of claim 6, furthercomprising the step of: comparing the feature vectors to a spectralmodel and computing a set of probability results.
 10. The method ofclaim 9, wherein the spectral model is a Hidden Markov Model.
 11. Themethod of claim 9, wherein the spectral analysis step further comprisesthe step of: comparing the set of probability results with a vocabularyin order to generate the set of syllabic predictions.
 12. The method ofclaim 1, wherein the tonal analysis step further comprises the steps of:generating a fundamental frequency contour of the speech waveform usingthe timing alignment information from the spectral analysis step;extracting one or more tonal features from the fundamental frequencycontour; and generating the tonal scores based on the one or moreextracted tonal features.
 13. The method of claim 12, wherein the one ormore tonal features includes a ratio of the fundamental frequencies fortwo adjacent syllables in the speech waveform.
 14. The method of claim12, wherein the one or more tonal features includes a slope measurementof the fundamental frequency contour.
 15. The method of claim 12,wherein the tonal features includes a ratio of the fundamentalfrequencies for two adjacent syllables in the speech waveform and aslope measurement of the fundamental frequency contour.
 16. The methodof claim 12, wherein the generating the tonal scores step furthercomprises the steps of: providing a tonal model including a plurality ofsub-models that describe a set of possible adjacent tones; and comparingthe tonal features to the plurality of sub-models in order to generatethe tonal score.
 17. The method of claim 12, wherein the generating afundamental frequency contour step further comprises the steps of:determining a center point within each syllable of the speech waveformusing a beginning and ending point specified by the timing alignmentinformation; determining the energy of the syllable at the center point;generating an analysis window for each syllable, wherein the analysiswindow is centered at the center point and is bounded on either side ofthe center point by calculating the points at which the energy of thesyllable has decreased to a first predetermined percentage of the energyat the center point; and computing the fundamental frequency within theanalysis window.
 18. The method of claim 17, wherein the computing thefundamental frequency step further comprises the steps of: computing adifference function within the analysis window to generate at least oneglobal minimum and one or more local minimums, wherein the globalminimum has a value that is lower than all of the local minimums; andselecting the global minimum in order to compute the fundamentalfrequency.
 19. The method of claim 18, wherein the computing thefundamental frequency step further comprises the steps of: computing anaverage frequency for a plurality of adjacent syllables using thefundamental frequencies computed from the selected global minimums;within each analysis window, selecting the local minimum that mostclosely corresponds to the average frequency, if the global minimum isless than the selected local minimum by a predetermined threshold level,then using the difference function for the global minimum in order tocalculate the fundamental frequency, otherwise using the differencefunction for the selected local minimum in order to calculate thefundamental frequency.
 20. The method of claim 19, wherein the computingthe fundamental frequency step further comprises the steps of: expandingthe analysis window for each syllable to a point where the energy of thesyllable has decreased to a second predetermined percentage of theenergy at the center point; and computing the fundamental frequencywithin the expanded analysis window.
 21. The method of claim 20, whereinthe computing the fundamental frequency step further comprises the stepsof: computing a difference function within the expanded analysis windowto generate at least one global minimum and one or more local minimums,wherein the global minimum has a value that is lower than all of thelocal minimums; and selecting the global minimum in order to compute thefundamental frequency.
 22. The method of claim 21, wherein the computingthe fundamental frequency step further comprises the steps of: computingan average frequency for a plurality of adjacent syllables using thefundamental frequencies computed from the selected global minimums;within each expanded analysis window, selecting the local minimum thatmost closely corresponds to the average frequency, if the global minimumis less than the selected local minimum by a predetermined thresholdlevel, then using the difference function for the global minimum inorder to calculate the fundamental frequency, otherwise using thedifference function for the selected local minimum in order to calculatethe fundamental frequency.
 23. A speech recognition system, comprising:a spectral modeling block that analyzes a speech waveform and generatesa plurality of predicted syllables based upon the spectral content ofthe speech waveform, wherein each of the predicted syllables includes anassociated spectral score and timing alignment information indicatingthe duration of the syllable; and a tonal modeling block thatsequentially analyzes the speech waveform using the timing alignmentinformation from the spectral modeling block and generates a pluralityof tone scores based upon the tonal content of the speech waveform foreach of the predicted syllables.
 24. The speech recognition system ofclaim 23, further comprising: a combination block for combining thespectral scores with the tone scores in order to generate an outputprediction of the most likely syllable.
 25. The speech recognitionsystem of claim 23, wherein the spectral modeling block furthercomprises: a spectral analyzer for performing a spectral analysis of thespeech waveform and for generating a multi-dimensional vector thatdescribes the spectral content of the speech waveform.
 26. The speechrecognition system of claim 25, wherein the spectral analysis utilizes afast fourier transform algorithm.
 27. The speech recognition system ofclaim 25, wherein the spectral analysis utilizes a mel frequencycepstral coefficients algorithm.
 28. The speech recognition system ofclaim 25, wherein the spectral analysis utilizes a linear protectioncoefficient algorithm.
 29. The speech recognition system of claim 25,wherein the spectral modeling block further comprises: a featureextraction block for analyzing the multi-dimensional vector and forgenerating a feature vector, wherein the feature vector includes themulti-dimensional vector and one or more additional dimensional vectorsthat describe a spectral feature of the speech waveform.
 30. The speechrecognition system of claim 29, wherein the spectral feature is theenergy of the speech waveform.
 31. The speech recognition system ofclaim 29, wherein the spectral feature is a differential calculation ofthe speech waveform.
 32. The speech recognition system of claim 29,wherein the spectral modeling block further comprises: a model scoringblock for comparing the feature vectors to a spectral model and forcomputing a set of probability values; and a model database for storingthe spectral model.
 33. The speech recognition system of claim 32,wherein the spectral model is a Hidden Markov Model.
 34. The speechrecognition system of claim 32, wherein the spectral modeling blockfurther comprises: an N-best search block for comparing the set ofprobability values with a vocabulary and for selecting a set of N mostlikely predicted syllables; and a vocabulary database for storing thevocabulary.
 35. The speech recognition system of claim 23, wherein thetonal modeling block further comprises: an F0 estimation block forgenerating a fundamental frequency contour of the speech waveform usingthe timing alignment information from the spectral modeling block; afeature extraction block for extracting one or more tonal features fromthe fundamental frequency contour; and a model scoring block forgenerating the plurality of tone scores based on the one or moreextracted tonal features.
 36. The speech recognition system of claim 35,wherein the one or more tonal features includes a ration of thefundamental frequencies for two adjacent syllables in the speechwaveform.
 37. The speech recognition system of claim 35, wherein the oneor more tonal features includes a slope measurement of the fundamentalfrequency contour.
 38. The speech recognition system of claim 35,wherein the tonal features includes a ratio of the fundamentalfrequencies for two adjacent syllables in the speech waveform and aslope measurement of the fundamental frequency contour.
 39. A system foranalyzing a speech waveform, comprising: a spectral modeling branch forgenerating a spectral score; and a tonal modeling branch for generatinga tonal score; wherein the spectral modeling branch generates timingalignment information that indicates the beginning and ending points fora plurality of syllables in the speech waveform and provides this timingalignment information to the tonal modeling branch in order tosequentially analyze the speech waveform.
 40. A method of analyzing aspeech waveform carrying a plurality of syllables, comprising the stepsof: performing a spectral analysis on the speech waveform and generatingone or more spectral scores for each syllable; performing a tonalanalysis on the speech waveform and generating one or more tonal scoresfor each syllable, wherein the tonal scores are generating by comparingthe fundamental frequencies of two or more adjacent syllables; andcombining the spectral scores with the tonal scores to produce an outputprediction.
 41. A method of recognizing tonal information in a speechwaveform, comprising the steps of: generating timing alignmentinformation for a plurality of syllables in the speech waveform;determining a center point within each syllable of the speech waveformusing a beginning and ending point specified by the timing alignmentinformation; determining the energy of the syllable at the center point;generating an analysis window for each syllable, wherein the analysiswindow is centered at the center point and is bounded on either side ofthe center point by calculating the points at which the energy of thesyllable has decreased to a first predetermined percentage of the energyat the center point; computing a fundamental frequency contour withinthe analysis window; extracting one or more tonal features from thefundamental frequency contour; and generating a plurality of tonalscores for each syllable based on the one or more extracted tonalfeatures.
 42. The method of claim 41, wherein the computing afundamental frequency step further comprises the steps of: computing adifference function within the analysis window to generate at least oneglobal minimum and one or more local minimums, wherein the globalminimum has a value that is lower than all of the local minimums; andselecting the global minimum in order to compute the fundamentalfrequency.
 43. The method of claim 42, wherein the computing afundamental frequency step further comprises the steps of: computing anaverage frequency for a plurality of adjacent syllables using thefundamental frequencies computed from the selected global minimums;within each analysis window, selecting the local minimum that mostclosely corresponds to the average frequency, if the global minimum isless than the selected local minimum by a predetermined threshold level,then using the difference function for the global minimum in order tocalculate the fundamental frequency, otherwise using the differencefunction for the selected local minimum in order to calculate thefundamental frequency.
 44. The method of claim 43, wherein the computinga fundamental frequency step further comprises the steps of: expandingthe analysis window for each syllable to a point where the energy of thesyllable has decreased to a second predetermined percentage of theenergy at the center point; and computing the fundamental frequencywithin the expanded analysis window.
 45. The method of claim 44, whereinthe computing a fundamental frequency step further comprises the stepsof: computing a difference function within the expanded analysis windowto generate at least one global minimum and one or more local minimums,wherein the global minimum has a value that is lower than all of thelocal minimums; and selecting the global minimum in order to compute thefundamental frequency.
 46. The method of claim 45, wherein the computingthe fundamental frequency step further comprises the steps of: computingan average frequency for a plurality of adjacent syllables using thefundamental frequencies computed from the selected global minimums;within each expanded analysis window, selecting the local minimum thatmost closely corresponds to the average frequency, if the global minimumis less than the selected local minimum by a predetermined thresholdlevel, then using the difference function for the global minimum inorder to calculate the fundamental frequency, otherwise using thedifference function for the selected local minimum in order to calculatethe fundamental frequency.